2. Association and Correlation

Interrelationships between variables

Preparatory work

  • Read: OpenIntro Statistics Sections 1.2.3, 1.2.4, and 1.2.5
  • Review Vocabulary: (next slide)

Vocabulary for relationships

  • Dependent / Associated / Correlated: Some statistical relationship exists
  • Positive Correlation: Higher x values linked to higher y values
  • Negative Correlation: Higher x values linked to lower y values
  • Independent: The variables are unrelated. No relationship exists.
  • Explanatory and Response variables: Variables causing an effect, or being effected, or expected to.
  • Correlation Coefficient \(\rho\): A numerical measurement of correlation. The output of df.corr()

Using vocabulary / testing understanding

Based on your best understanding or best guess…

  1. Among students, weekly study time and GPA are likely _____
  2. Among mammal species, heart rate and mass are likely _____
  3. Among American football players, #tackles and #rushing yards are likely _____
  4. Among people, eye color and ice cream preference are likely _____
  5. Among Americans, income and political party affiliation are likely _____
  6. Year by year, global average temperatures and the Dow Jones Industrial Average are likely _____

Correlation Coefficient

Just how correlated are two numerical variables?

The correlation coefficient \(\rho\) measures strength and direction of correlation.

Use df.corr(numeric_only=True) to calculate.

\(\rho\) nearly 1: x and y are strongly positively correlated.

\(\rho\) nearly 0: x and y are independent, at least not linearly correlated.

\(\rho\) nearly -1: x and y are strongly negatively correlated.

Correlations quickly flag trends and interrelationships for further study.

Scatterplots and correlations

A scatterplot is 2D graph depicting each data case as a point positioned on an \(x\)-axis according to one variable and on the \(y\)-axis according to another. Like most graphs, they primarily target sighted users.

A scatterplot showing a strong, positive, linear correlation between weight (x-axis) and desired weight (y-axis). Most data points cluster tightly along a diagonal line.
Figure 1: Scatterplot (weight vs desired weight) showing strong positive correlation (\(\rho = 0.74\))

Scatterplots and correlations

A scatterplot is 2D graph depicting each data case as a point positioned on an \(x\)-axis according to one variable and on the \(y\)-axis according to another. Like most graphs, they primarily target sighted users.

A scatterplot showing no significant correlation between age (x-axis) and weight (y-axis). Data points are broadly scattered in cloud with non trend lines.
Figure 2: Scatterplot (age vs weight) showing no significant correlation (\(\rho = 0.09\))

Scatterplots and correlations

A scatterplot is a 2D graph depicting each data case as a point positioned on an \(x\)-axis according to one variable and on the \(y\)-axis according to another. Like most graphs, they primarily target sighted users.

A scatterplot showing a weak negative correlation between age (x-axis) and height (y-axis). Data points are broadly scattered with a slight downward trend.
Figure 3: Scatterplot (age vs height) showing weak negative correlation (\(\rho = -0.30\))

Fisher’s Irises

Sir Ronald Fisher, British statistician and geneticist, introduced his now famous Iris data in 1936, with 150 cases involving three very similar species:

His dataset is often used as the “hello world” of data exploration.

Correlations can change when data is regrouped.

A negative Correlation

A scatterplot showing a weak negative correlation between age (x-axis) and height (y-axis). Data points are broadly scattered with a slight downward trend.

Scatterplot showing weak negative correlation between petal width and sepal width

3 positive correlations

A scatterplot showing a weak negative correlation between age (x-axis) and height (y-axis). Data points are broadly scattered with a slight downward trend.

Scatterplot showing three strong positive correlations between petal width and sepal width

Types of statistical association

Matt says “Education level and political party association are positively associated.” Explain why Matt must be wrong, regardless of politics.

Fiona says “Learning piano has no bearing on hair color, and hair color does not influence piano interest or ability. Therefore hair color and piano skill are statistically independent.” Find the flaw in her logic.

Individual work

Solve five problems on webwork under “2. Association and Correlation.”

As always, find the preparatory work for the next slide deck and do it before class.